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^ ' Abstract 

r^^ ' 

i-^ ' Estimation of Distribution Algoritlims have been proposed as a new paradigm 

(<-^ , for evolutionary optimization. This paper focuses on the parallelization of Esti- 

VQ ' mation of Distribution Algorithms. More specifically, the paper discusses how to 

t"*^ ' predict performance of parallel Mixed Bayesian Optimization Algorithm (MBOA) 

^^ , that is based on parallel construction of Bayesian networks with decision trees. 

c7^ ' We determine the time complexity of parallel Mixed Bayesian Optimization Algo- 

ls I rithm and compare this complexity with experimental results obtained by solving 

^ ■ the spin glass optimization problem. The empirical results fit well the theoreti- 

cal time complexity, so the scalability and efficiency of parallel Mixed Bayesian 
Optimization Algorithm for unknown instances of spin glass benchmarks can be 
ji^ ■ predicted. Furthermore, we derive the guidelines that can be used to design effec- 

^ ' tive parallel Estimation of Distribution Algorithms with the speedup proportional 

to the number of variables in the problem. 

1 Introduction 

Estimation of distribution algorithms (EDAs) Q] |2l , also called probabilistic model- 
building algorithms (PMBGAs) (3) and iterated density estimation evolutionary algo- 
rithms (IDE As) f4l, are stochastic techniques that combine genetic and evolutionary 
algorithms, machine learning, and statistics. EDAs are able to effectively discover and 
mix the building blocks of partial solutions, provided that the allowed complexity of 
probabilistic model is relevant to optimized problem. In contrast to genetic algorithms, 
their performance is not affected by the ordering of parameters. The relations between 



parameters of solved problem are discovered by machine-learning methods and incor- 
porated into the constructed probabilistic model. 

The necessary condition of successful EDA is the generality of its probabilistic 
model. The well known Bayesian Optimization Algorithm (BOA) |5 1 uses a very gen- 
eral probabilistic model - a Bayesian network - to encode the relations between discrete 
parameters of the solved problem. Local probability distributions in the Bayesian net- 
work in the original BOA were stored using full conditional probability tables. More 
efficient representation of the local distributions was used in the hierarchical Bayesian 
Optimization Algorithm (hBOA) |6|, which uses decision trees or graphs to represent 
these local distributions. 

The Mixed Bayesian Optimization Algorithm (MBOA) 1 7 1 is able to deal with dis- 
crete and continuous parameters simultaneously by using a Gaussian kernel model to 
capture local distributions of continuous parameters. Its detailed principles are de- 
scribed in 1 8 1. In binary domain it can be seen as a variant of hBOA, but its implemen- 
tation emphasizes the possibility to efficiently perform model building in parallel. The 
implementation of parallel MBOA was first simulated in the TRANSIM tool |9| and 
its real implementation was reported in 1 10|. 

This paper analyzes the time complexity of parallel Mixed Bayesian Optimization 
Algorithm for binary domain and compares this complexity with experimental results 
obtained by solving spin glass optimization problem. The following Section defines the 
spin glass benchmark and motivates for its parallel solving. Section |3] introduces the 
main principles of MBOA, and Sectionl^analyzes the time complexity of each MBOA 
part. Section|5]provides the scalability analysis of parallel MBOA, identifies the main 
parts that have to be performed in parallel and derives the guidelines that can be used 
to design effective parallel MBOA. The experimental results are presented in Section 
|6land conclusions are provided in Section^ 

Note that this paper does not investigate the relation between problem size n and 
the minimal population size required for its solving A^, but it provides methods for pre- 
dicting algorithmic efficiency of parallel MBOA given the problem-dependent relation 
between n and A^. See 1111 for the analysis of population sizing in BOA. 

2 Spin glass benchmark 

Finding the lowest energy configuration of a spin glass system is an important task in 
modern quantum physics. Each configuration of spin glass system is defined by the set 
of spins 

5= {.,|V/e {!,...,/'} :.,€ { + 1,-1}}, (1) 

where d is the dimension of spin glass grid and s is the size of spin glass grid. For 
the optimization by BOA the size of spin glass problem s^' is equal to the length n of a 
solution, thus 

n = s'' (2) 

Each spin glass benchmark instance is defined by the set of interactions {Jtj} between 
neighboring spins Sj and Sj in the grid. 



The energy of the given spin glass configuration S can be computed as 

E{S)= Y. JiJ^'^J (3) 

where the sum runs over all neighboring positions / and /' in the grid. For general 
spin glass systems the interaction is a continuous value Jij e< — 1,+1 >, but we fo- 
cus only on Ising model with either ferromagnetic bond Jij = — 1 or antiferromagnetic 
bond Jij = + 1 , thus J/j Q {—l,+l}. Obviously, in the case of ferromagnetic bond the 
lower (negative) contribution to the total energy is achieved if both spin are oriented in 
the same direction, whereas in the case of antiferromagnetic bond the lower (negative) 
contribution to the total energy is achieved if both spins are oriented in opposite direc- 
tions. Periodic boundary conditions are used to approximate the properties of arbitrary 
sized spin glass systems. 



3 Mixed Bayesian Optimization Algorithm 
3.1 Main principles 

The general procedure of EDA algorithm is similar to that of GA, but the classical 
recombination operators are replaced by learning and sampling a probabilistic model. 
First, the initial population of EDA is generated randomly. In each generation, promis- 
ing solutions are selected from the current population of N candidate solutions and the 
true probabihty distribution of these selected solutions is estimated. New candidate 
solutions are generated by sampling the estimated probability distribution. The new 
solutions are then incorporated into the original population, replacing some of the old 
solutions or all of them. The process is repeated until the termination criteria are met. 
Various models can be used in EDAs to express the underlying probability distri- 
bution. One of the most general models is the Bayesian network. A Bayesian network 
consists of two components. The first is a directed acyclic graph (DAG) in which 
each vertex corresponds to different variable of optimized problem which is treated as 
random variable. This graph describes conditional independence properties of the rep- 
resented distribution. It captures the structure of the optimized problem. The second 
component is a collection of conditional probability distributions (CPDs) that describe 
the conditional probability of each variable X, given its parents 11, (ancestors) in the 
graph. Together, these two components represent an unique probability distribution 

H-l 

p{Xo,...,X„-i)^Ylp{X,\Ui). (4) 

1=0 

The well known EDA instances with Bayesian network are the Bayesian Optimization 
Algorithm (BOA)(5l , the Estimation of Bayesian Network Algorithm (EBNA) fT2l 
and the Learning Factorized Distribution Algorithm (LFDA) |13| . All these algo- 
rithms use metrics-driven greedy algorithm for construction of dependency graph. The 
implementation details of Bayesian network construction can be found in 1141 . 



3.2 Advantages of decision trees 

The Mixed Bayesian Optimization Algorithm (MBOA) J7| is based on BOA with 
Bayesian networks, but it uses more effective structure for representing conditional 
probability distributions in the form of decision trees, as proposed in |6 1. It would be 
unreasonable to parallelize MBOA if we had no evidence about its merits over BOA, 
because BOA was already parallelized in several variants like PBOA and DBOA iTsl . 

There are several reasons for using decision trees. Firstly, only the coefficients for 
those combinations of II,- that significantly affect X, are estimated. With less coeffi- 
cients the value of each coefficient is - on average - estimated more precisely. Sec- 
ondly, the model building procedure allows for more precise model, since it explores 
networks that would have incurred an exponential penalty (in terms of the number of 
parameters required) and thus would have not been taken into consideration otherwise. 

MBOA can be also formulated for continuous domains, where it tries to find the 
partitioning of a search space into subspaces where the parameters seem to be mutually 
independent. This decomposition is captured globally by the Bayesian network model 
with decision trees and the Gaussian kernel distribution is used locally to approximate 
the values in each resulting partition. Without the loss of generality, in following Sec- 
tions we will focus on the performance of MBOA in binary domain. 

4 Complexity analysis 

4.1 Complexity of selection operator 

The most commonly used selection operator in EDAs is tournament selection, where 
pairs of randomly chosen individuals compete to take place in the parent population 
that serves for model learning. The number of tournaments is 0{N) and for each tour- 
nament we perform 0{n) steps to copy the whole chromozome, so the total complexity 
is 0{nN). This also holds for many other selection operators. 

4.2 Complexity of model construction 

The sequential hBOA builds all the decision trees at once. It starts with empty trees 
and it subsequently adds decision nodes. The quality of potential decision nodes is 
determined by Bayesian-Dirichlet metrics |14| which is able to determine the signifi- 
cance of statistical correlations between combinations of alleles in the population. A 
necessary condition for adding new split node is the acyclicity of dependency graph - it 
must be guaranteed that no variables Xi, Xj exist such that Xi is used as a split in the tree 
for gene Xj and at the same time Xj is used as a split in the tree for gene X,. In parallel 
MBOA it is necessary to build each decision tree separately in different processor, so 
we need an additional mechanism to guarantee the acyclicity. 

In J9]llOI we proposed the method that solves this problem. Each generation it uses 
random permutation o = (oq , o i , , o„_ i ) to predetermine topological ordering of genes 
in advance. This means that only the genes can serve as splits in the binary decision 
tree of target gene. The model causality might be violated by this constraint, but from 



the empirical point of view the quality of generated models is the same as in the se- 
quential case. The advantage is that each processor can create the whole decision tree 
asynchronously and independently of the other processors. Consequently, the linear 
speedup is achieved by removing this communication overhead. 

Function BuildTree (Population Pop, TargetVariable Xi, 

ListOfCandidateSplitVariables Pa) : DecisionTreeNode; 
Begin 

Initiate the frequency tables; ... 0(n) 

For each Variable Xj in Pa ... 0(n) 

Evaluate the metrics gain of Xj split for Xi target; ... 0(N) 
End for 

Pick up the split Xj' with the highest quality; ... 0(n) 

Popl := Selectlndividuals (Pop,"Xj' = 0"); ... 0(N) 

Pop2 := Selectlndividuals (Pop,"Xj' = 1"); ... 0(N) 

return new SplitNode (new SplitCondition ("Xj' " ) , 

BuildTree (Popl, Xi,Pa-{Xj' }) , 
BuildTree (Pop2,Xi,Pa-{Xj' }) ) ; 
End; 

To express the time complexity of whole model building algorithm, we start with 
the complexity of one run of BuildTree() procedure, which is 0{n + nN + n+N + 
N) = 0{n) + 0{nN). The total number of BuildTree() calls is 0(2''), where h is the 
average height of final decision tree, but note that the population size N is decreased 
exponentially in the recursion: 

h h h 

£2^(0(«) + 0{nN/2'')) « Y.2'\0{nN /l'') ^ £ 0{nN) = 0{hnN) (5) 

1=1 !=1 !=1 

To be precise, the time spent on initialization of frequency tables and the time spent on 
picking-up of the best split is not compensated by this exponential decrease of popula- 
tion size, since it does not depend on A^. However, in practical cases we can neglect this 
term because due to the model penalty the depth of recursion stops at some boundary 
where the population size is still reasonable such that the time required for computation 
of metrics is always higher than the time for initialization of frequency tables and for 
picking of the highest gain. The final complexity of building of whole decision tree is 
thus O(hnN) and the complexity of building whole probabilistic model composed of 
n decision trees is 0{hn^N). On various suites of optimization problems we observed 
that the depth of decision trees is proportional to log(A^). Hence, the time complexity 
of model building can be rewritten as 0{n^N\og{N)). This time complexity also holds 
for hBOA, but its model building algorithm is not intended for parallel processing. 

4.3 Complexity of model sampling 

Model sampling generates 0{N) individuals of length n, where each gene is generated 
by traversing down the decision tree to the leaf. On average, it takes 0{h) = 0{log{N)) 
decisions before the leaf is reached. Thus, the overall complexity of model sampling is 
0{nNlog{N)). 



4.4 Complexity of replacement operator 

For the MBOA algorithm we use the Restricted Tournament Replacement (RTR) to 
replace a part of target population by generated offspring. RTR was proposed in (5) 
and its code can be specified as follows: 

For each offspring individual from offspring population ...0(N) 
For a*N randomly chosen individuals from target pop. ...0(a*N) 
Compute the distance between chosen individual 

and offspring individual; ...0(n) 

End for 

If fitness of offspring individual is higher than the fitness 
of chosen individual with closest distance then 
Replace the chosen individual by offspring individual; .. .0 (n) 
End if 
End for 

The whole time complexity is then 0{N{anN + n)) = 0{anN^), where the coeffi- 
cient a determines the percentage of randomly chosen individuals in target population 
that undergo the similarity comparison with each offspring individual. The greater the 
a, the the stronger the pressure on diversity preservation. Note that the complexity of 
RTR exhibits the complexity of most other existing replacement operators. 

4.5 Complexity of fitness evaluation 

In the experimental part of this paper we use the spin glass benchmark (defined in 
Section |2j as a real-world example of an NP complete optimization problem. The 
evaluation time of one spin glass configuration is linearly proportional to the number 
of bonds, which is linearly proportional to the number of spins 0{n). For A^ individuals 
the evaluation time grows with 0{nN). This complexity also holds for decomposable 
deceptive functions and many artificial benchmarks. 

Optionally, MBOA can use hill-climbing heuristics to improve the fitness of each 
individual. This heuristics tries to flip each bit and the change with the highest fit- 
ness increase is picked and accepted. This improvement is repeated until no flipping 
with positive outcome exists. Empirically the number of successful acceptances per 
chromozome is 0{^/n) and after each acceptance it takes 0(1) time to re-compute the 
outcomes of neighboring flips and 0{n) time to pick the best flip for next change. The 
total complexity of hill climbing heuristics for the whole population is thus 0{nN^/n). 
The algorithm for picking the best flip for next change can be implemented even more 
effectively, so the total complexity 0{\og{n)Nn) can be achieved. In our analysis we 
consider the case without heuristics, but we see that the complexity of heuristics only 
sUghtly prevails the complexity of spin glass evaluation without heuristics. 



5 Scalability analysis 

The whole execution time of one generation of MBOA algorithm can be formed by 
summing up the complexity of selection (weighted by ci), model building (weighted 
by C2), model sampling (weighted by C3), replacement (weighted by 04), and fitness 
evaluation (weighted by C5): 



ciO{nN)+C20{n^mog{N))+C30{nmog{N))+C40{anN^)+C50{nN) (6) 

To develop scalable and efficient parallel Mixed Bayesian Optimization Algorithm 
we need to identify the main tasks that are candidates for parallelization. 

Apparently, the most complex part of sequential MBOA is the construction of prob- 
abilistic model. Also our experience with solving spin glass benchmarks confirms this 
statement - typically nearly 95 % of the execution time is spent on construction of prob- 
abilistic model. In Section l4!2l we outlined an algorithm for parallel construction of 
Bayesian network with decision trees proposed in ll^ llOI . It uses the principle of re- 
stricted ordering of nodes in graphical probabilistic model. This method reduces the 
communication between processors, so the speedup is almost linear and the model 
building is able to effectively utilize up to P = n/2 processors. 

As the first case of our analysis, we consider only this parallelization of probabilis- 
tic model construction and keep the other parts sequential. 

With P processors the overall time complexity of parallel MBOA is 

ciO{nN) + -C20{n^N\og{N))+C30{nNlog{N))+C40{anN^)+C50{nN) (7) 
Now we will analyze the proportion between sequential and parallel part of MBOA: 

ci 0{nN) + cj.O{nNlog{N)) + C40(anN^) + csO{nN) 



^C20{n'^N\og{N)) 



(8) 



To obtain an algorithm that is efficiently scalable, this proportion should approach 
zero as n grows. 

5.1 Scalability for fixed number of processors 

We first analyze scalability in the case of constant P and increasing n. This is the typical 
scenario when the computational resources are fixed but the problem to be solved is 
very large. The detailed analysis of terms in fraction (|8|i for P = const., n ^ °o gives 
us the following suggestions for design of parallel MBOA: 

• The terms with ci and C5 are negligible for scalability: 

^.^cMnNHcsOjnN) ^^ 
«^~ ^C20{n^Nlog{N)) 



In another words, neither the selection operator nor the population evaluation 
have to be implemented in parallel. Of course this outcome is valid only for 
population evaluation with time complexity 0{nN). For quadratic complexity of 
population evaluation 0{n^N) the scalability will depend on the absolute values 
of constants C5, C2 and on the problem-dependent relation between "n" and "N". 
Theoretically, if the population size grows linearly with problem size (A' o^ n) 
it is still possible to keep the fitness evaluation sequential, because the log(A^) 
term in the denominator prevails. Nevertheless, in practical situations we sug- 
gest parallel evaluation of problems with quadratic and higher complexity. The 
parallelization of fitness evaluation is solved problem and its implementation is 
straightforward. For more detailed discussion of parallel fitness evaluation see 

M- 

• The sampling of model does not have to be performed in parallel, since for all 
possible assignments to constants cj, C3 and P it always holds 

hm ;30(nA.log(A^)) ^, 

• The Restricted tournament replacement has not to be performed in parallel if 

lim -i ^-4 = (11) 

«^~ ^C20yNlog{N)) 

In this case the scalability highly depends on the problem-dependent relation 
between n and A'. Theoretically, even if the population size grows linearly with 
the problem size (A^ <=<: «) the fraction tends to zero because the log(A^) term in 
the denominator prevails. However, in practical situations we suggest RTR to be 
performed in parallel. 

5.2 Scalability for increasing number of processors 

So far we have analyzed the scalability of sequential BOA for fixed number of pro- 
cessors. Now we will analyze how the scalability changes if the number of available 
processors scales up with n. In this case the execution time is reduced by an order of n. 
By assuming P °^ n,we obtain from Equation (|8ji: 

cxO[nN) + C3C>(nA^log(A^)) + c^OiariN^) + csO{nN) 
C20(«A^log(A^)) 

We see that the selection operator and the simple evaluation of population (terms 
with constants c\ and C5) can still be implemented sequentially, but it does not hold 
for fitness evaluation with quadratic and higher complexity any more. The decision 
about implementation of model sampling strongly depends on the required speedup. 
If sequential model sampling is performed, then the speedup is saturated at C2/C3. 
RTR has to be necessarily implemented in parallel, because for fixed C4, C2, and a the 
numerator always prevails the denominator. 



6 Experimental results 

6.1 Fitting complexity coefficients 

We performed a series of experiments on the random instances of 2D Ising spin glass 
benchmarks of size 100, 225, 400, 625, 900 for population sizes A^ = 500, A^ = 1000, 
A^ = 1500, A^ = 2000, A^ = 4000, A^ = 6000 and A^ = 8000. We measured sepai-ately 
the duration of each part of sequential MBOA in order to determine the coefficients 
ci,C2,C3,C4,C5. The fitted coefficients are stated in Tab. [2 We observed that for larger 
problem sizes the fitting is in agreement with the empirical data, but for lower problem 
sizes the measured time is smaller than that expected from the theoretical complexity. 
This can be explained by the effects of cache memory. 



MBOA part 


Coefficient 


Estimated value 


R- squared value 


selection 


Cl 


8.73E-09 


0.978 


model building 


C2 


l.OOE-07 


0.979 


model sampling 


C3 


1.58E-07 


0.934 


replacement (RTR) 


C4*a 


2.18E-10 


0.989 


evaluation 


C5 


1.34E-07 


0.918 



Table 1: The resulting values of coefficients ci,C2,C3,C4,C5. For each Ising spin glass 
size 100, 225, 400, 625, 900 and each population size A^ = 500, A^ = 1000, A^ = 1500, 
A^ = 2000, A^ = 4000, N = 6000, and N = 8000 we choose 10 random benchmark 
instances and averaged the duration of each MBOA part. The coefficients hold for one 
generation of MBOA performed on Intel Pentium-4 at 2.4 GHz. The slopes of linear 
trend lines were determined in MS Excel. 



6.2 Using complexity coefficients 

The obtained coefficients can be used to predict the speedup of MBOA with parallel 
model building. Given the problem size «, the population size A^, and the number of 
processors P, we get: 



S = 



c\nN + C2n'^Nlog{N) + cjnNlog{N) + C4anN^ + c^nN 
c\nN + jiC2rfiN\og{N) +c^nN\og{N) + c^anN^ + c^nN 



(13) 



Fig. n shows how the predicted speedup changes for increasing P and compare 
it with the speedup computed from the measured duration of each part of sequential 
MBOA. We considered 3 different sizes of spin glass instances 20x20, 25x25, and 
30x30 and we linearly increased the population size with problem size (A^ = 4000, 
A^ = 6000, A^ = 8000). As one can see, the predicted speedup fits nicely the empirical 
speedup, namely for large problem size where the caching effects disappear Also, it 
can be seen that it is possible to use larger number of processors (more than P = 50) 
without observing any significant speedup saturation. 



41 



V, 36 



31 - 



26 



21 

16 
11 



-^>^ speedup for 2D spin glass 20x20, N=4000 
-♦■ - predicted speedup for 2D spin giass 20x20, N=4000 

— •— speedup for 2D spin giass 25x25, N=6000 
-Zy - predicted speedup for 2D spin giass 25x25, N=6000 

-A— speedup for 2D spin giass 30x30, N=8000 
■D - predicted speedup for 2D spin giass 30x30, N=8000 




Figure 1: The comparison of the speedup predicted from the numerical fit and the 
speedup computed from the empirical data measured on sequential BOA solving 2D 
Ising spin glass instances of size 20x20, 25x25, and 30x30. Population size scales 
approximately linearly with the problem size. 

Note that Equation J13> assumes that the model building is ideally parallelized. 
This is why we were able to calculate the speedup using the time measurements from 
sequential MBOA. In practical situations the communication time required by master 
to communicate the population and to gather the parts of the model from slave pro- 
cessors has to be considered as well. In this analysis we neglect this term because the 
implementation of message passing interface (MPI) might be platform-dependent and 
its theoretical time complexity is not known. The speedup measured on the implemen- 
tation of parallel MBOA using Beowulf cluster of 502 Intel Pentium III computational 
nodes is shown in Fig. |2]obtained from I.10J . 

7 Conclusions and future work 

We analyzed the time complexity of Mixed Bayesian Optimization Algorithm and fit- 
ted this complexity to experimental data obtained by solving spin glass optimization 
problem. The empirical results fit well the theoretical time complexity equation, so the 
scalability and algorithmic efficiency of parallel Mixed Bayesian Optimization Algo- 
rithm can be predicted, e.g. the speedup for given spin glass size and given number of 
processors can be determined. 

Furthermore, we derive the guidelines that can be used to design effective parallel 
Estimation of Distribution Algorithms for arbitrary optimization problems where the 
relation between problem size and the minimal population size required for solving the 
problem is known. Especially, we focus on the identification of the parts of MBOA 



10 



50 
45 
40 
35 
30 
25 



-All hBOA parts, n=900, N=500 
-All hBOA parts, n=900, N=2000 
-All hBOA parts, n=900, N=S000 
-Model construction, n=900, N=500 
-Model construction, n=900, N=2000 
-Model construction, n=900, N=8000 




Figure 2: Speedup of the whole parallel MBOA and the speedup of model building part 
(including true MPI communication) on random spin glass instance with 30x30 spins. 

which have to be implemented in parallel. 

So far the model building was the only part considered to be performed in parallel, 
but our analysis identified that under some circumstances the parallelization of other 
parts of MBOA might be required as well. For example, the parallel fitness evaluation 
(see 1 16 1) can be implemented for problems with expensive fitness evaluation. With 
careful parallelization of all necessary MBOA parts the achieved speedup is propor- 
tional to the problem size. The implementation of all critical MBOA parts in parallel 
will be a subject of future work. 
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